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Abstract 

We show that a neural network with arbitrary depth and non-linearities, with 
dropout applied before every weight layer, is mathematically equivalent to an ap¬ 
proximation to a well known Bayesian model. This interpretation might offer an 
explanation to some of dropout’s key properties, such as its robustness to over- 
htting. Our interpretation allows us to reason about uncertainty in deep learning, 
and allows the introduction of the Bayesian machinery into existing deep learning 
frameworks in a principled way. 

This document is an appendix for the main paper “Dropout as a Bayesian Approx¬ 
imation: Representing Model Uncertainty in Deep Learning” by Gal and Ghahra¬ 
mani, 2015 (http ://arxiv.org/ab s/1506.02142). 


1 Introduction 


Deep learning works very well in practice for many tasks, ranging from image processing 
[Krizhevsky et al. 20121 to language modelling [Bengio et al. 20061. However the framework 
has some major limitations as well. Our inability to reason about uncertainty over the features is an 
example of such - the features extracted from a dataset are often given as point estimates. These 
do not capture how much the model is conhdent in its estimation. On the other hand, probabilistic 
Bayesian models such as the Gaussian process [Rasmussen and Williams|[2006| offer us the ability 
to reason about our conhdence. But these often come with a price of lessened performance. 


Another major obstacle with deep learning techniques is over-htting. This problem has been largely 
answered with the introduction of dropout [ Hinton et al.[ 2012[ Srivastava et al. 20141. Indeed many 
modern models use dropout to avoid over-fitting in practice. Over the last several years many have 
tried to explain why dropout helps in avoiding over-fitting, a property which is not often observed 
in Bayesian models. Papers such as | |Wager et al. 2013[ Baldi and Sadowski 2013| have suggested 
that dropout performs stochastic gradient descent on a regularised error function, or is equivalent to 
an L 2 regulariser applied after scaling the features by some estimate. 


Here we show that a deep neural network (NN) with arbitrary depth and non-linearities, with dropout 
applied before every weight layer, is mathematically equivalent to an approximation to the proba¬ 
bilistic deep Gaussian process model j jDamianou and Lawrence|[2013| (marginalised over its covari¬ 
ance function parameters). We would like to stress that no simplifying assumptions are made on the 
use of dropout in the literature, and that the results derived are applicable to any network architecture 
that makes use of dropout exactly as it appears in practical applications. We show that the dropout 
objective, in effect, minimises the Kullback-Leibler divergence between an approximate distribution 
and the posterior of a deep Gaussian process (marginalised over its finite rank covariance function 
parameters). 


We survey possible applications of this new interpretation, and discuss insights shedding light on 
dropout’s properties. This interpretation of dropout as a Bayesian model offers an explanation to 
some of its properties, such as its ability to avoid over-fitting. Further, our insights allow us to treat 
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NNs with dropout as fully Bayesian models, and obtain uncertainty estimates over their features. In 
practice, this allows the introduction of the Bayesian machinery into existing deep learning frame¬ 
works in a principled way. Lastly, our analysis suggests straightforward generalisations of dropout 
for future research which should improve on current techniques. 

The work presented here is an extensive theoretical treatment of the above, with applications studied 
separately. We next review the required background, namely dropout, Gaussian processes, and 
variational inference. We then derive the main results of the paper. We finish with insights and 
applications, and discuss how various dropout variants fit within our framework. 


2 Background 


We review dropout, the Gaussian process modeQ and approximate variational inference quickly. 
These tools will be used in the following section to derive the main results of this work. We use the 
following notation throughout the paper. Bold lower case letters (x) denote vectors, bold upper case 
letters (X) denote matrices, and standard weight letters (a:) denote scalar quantities. We use sub¬ 
scripts to denote either entire rows / columns (with bold letters, x^), or specific elements {Xij). We 
use subscripts to denote variables as well (such as Wi : Q x iT, W 2 : K x D), with corresponding 
lower case indices to refer to specific rows / columns (w^, for the first variable and w^, for 
the second). We use a second subscript to denote the element index of a specific variable: Wi^qk 
denotes the element at row q column k of the variable W 1 . 


2.1 Dropout 


We review the dropout NN model | Hinton et al. 2012 [ Srivastava et al. 2014 ) quickly for the case of 
a single hidden layer NN. This is done for ease of notation, and the generalisation to multiple layers 
is straightforward. Denote by Wi, W 2 the weight matrices connecting the first layer to the hidden 
layer and connecting the hidden layer to the output layer respectively. These linearly transform 
the layers’ inputs before applying some element-wise non-linearity a{-). Denote by b the biases 
by which we shift the input of the non-linearity. We assume the model to output D dimensional 
vectors while its input is Q dimensional vectors, with K hidden units. Thus Wi is a Q x IT matrix, 
W2 is a IT X D matrix, and b is a iT dimensional vector. A standard NN model would output 


= tT(xWi -f b)W 2 given some input xj^ 


Dropout is applied by sampling two binary vectors zi, Z 2 of dimensions Q and K respectively. The 
elements of the vectors are distributed according to a Bernoulli distribution with some parameter 
Pi G [0,1] for i = 1,2. Thus zi ^ ~ Bernoulli(pi) for q = 1, and Z 2 .k ~ Bernoulli(p 2 ) 
for fc = Given an input x, 1 — pi proportion of the elements of the input are set to 

zero: x o zi where o signifies the Hadamard product. The output of the first layer is given by 
(t((x o zi)Wi -f b) o z2, which is linearly transformed to give the dropout model’s output y = 
((tT( (xozi)Wi -|-b)) 0 Z 2 ) W 2 . This is equivalent to multiplying the weight matrices by the binary 
vectors to zero out entire rows: 

y = cr(x(ziWi) -t- b)(z 2 W 2 ). 

The process is repeated for multiple layers. Note that to keep notation clean we will write zi when 
we mean diag(zi) with the diag(-) operator mapping a vector to a diagonal matrix whose diagonal 
is the elements of the vector. 


To use the NN model for regression we might use the Euclidean loss (also known as “square loss”), 

1 ^ 
n—1 

where {yi,..., y^v} are N observed outputs, and {yi,..., y^v} being the outputs of the model with 
corresponding observed inputs {xi,..., x^v}. 

To use the model for classification, predicting the probability of x being classified with label 1, 
we pass the output of the model y through an element-wise softmax function to obtain normalised 
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For a full treatment of Gaussian processes, see Rasmussen and Williams] j2006|. 

Note that we omit the outer-most bias term as this is equivalent to centring the output. 
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scores: pnd = exp{ynd)/ {J2d' Taking the log of this function results in a s oftmax loss, 

1 ^ 

E = ( 2 ) 

n—1 

where c„ G [1, 2, is the observed class for input n. 

During optimisation a regularisation term is often added. We often use L 2 regularisation weighted 
by some weight decay A (alternatively, the derivatives might be scaled), resulting in a minimisation 
objective (often referred to as cost), 

/:dropout := i? + Ai 11 Wi 112 + A 2 11W 2 1 li + A 3 1 |b| li (3) 

We sample new realisations for the binary vectors for every input point and every forward pass 
thorough the model (evaluating the model’s output), and use the same values in the backward pass 
(propagating the derivatives to the parameters). 

The dropped weights ziWi and Z 2 W 2 are often scaled by ^ to maintain constant output magni¬ 
tude. At test time no sampling takes place. This is equivalent to initialising the weights with 
scale ^ with no further scaling at training time, and at test time scaling the weights by 

We will show that equations Q to 0 arise in Gaussian process approximation as well. But first, we 
introduce the Gaussian process model. 

2.2 Gaussian Processes 


The Gaussian process (GP) is a powerful tool in statistics that allows us to model distributions 
over functions. It has been applied in both the supervised and unsupervised domains, for both 
regression and classification tasks | Rasmussen and Williams 2006 [ Titsias and Lawrence 2010 


|Gal et akj |2015| . The Gaussian process offers desirable properties such as uncertainty estimates 
over the function values, robustness to over-fitting, and principled ways for hyper-parameter tuning. 
The use of approximate variational inference for the model allows us to scale it to large data via 
stochastic and distributed inference QHensman et al.[[2013[|Gal et al.||2014| . 


Given a training dataset consisting of N inputs {xi,...,xjv} and their corresponding outputs 
{yi,... jYn}, we would like to estimate a function y = f(x) that is likely to have generated 
our observations. We denote the inputs X G and the outputs Y G 


What is a function that is likely to have generated our data? Following the Bayesian approach we 
would put some prior distribution over the space of functions p(f). This distribution represents 
our prior belief as to which functions are more likely and which are less likely to have generated 
our data. We then look for the posterior distribution over the space of functions given our dataset 
(X,Y): 

p(f|X,Y)cxp(Y|X,f)p(f). 

This distribution captures the most likely functions given our observed data. 


By modelling our distribution over the space of functions with a Gaussian process we can analyti¬ 
cally evaluate its corresponding posterior in regression tasks, and estimate the posterior in classifi¬ 
cation tasks. In practice what this means is that for regression we place a joint Gaussian distribution 
over all function values, 

F|X^AA(0,K(X,X)) (4) 

Y\¥ 

with some precision hyper-parameter r and where Iat is the identity matrix with dimensions N x N. 
For classihcation we sample from a categorical distribution with probabilities given by passing Y 
through an element-wise softmax, 

F|X~AA(0,K(X,X)) (5) 

Y|F~AA(F,0-Iw) 

c„ I Y ~ Categorical ^exp(y„d)/ ^^exp(y„d') 

for n = 1,..., iV with observed class label c„. Note that we did not simply write Y = F because of 
notational convenience that will allow us to treat regression and classification together. 
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To model the data we have to choose a covariance function K(Xi, X 2 ) for the Gaussian distribution. 
This function defines the (scalar) similarity between every pair of input points K(xi, Xj). Given a 
finite dataset of size JV this function induces an iV x X covariance matrix which we will denote 
K := K(X, X). For example we may choose a stationary squared exponential covariance function. 
We will see below that certain non-stationary covariance functions correspond to TanN (hyperbolic 
tangent) or ReLU (rectified linear) NNs. 

Evaluating the Gaussian distribution above involves an inversion of an iV by X matrix, an operation 
that requires 0{N^) time complexity. Many approximations to the Gaussian process result in a 
manageable time complexity. Variational inference can be used for such, and will be explained next. 

2.3 Variational Inference 

To approximate the model above we could condition the model on a finite set of random variables 
LJ. We make a modelling assumption and assume that the model depends on these variables alone, 
making them into sufficient statistics in our approximate model. 

The predictive distribution for a new input point x* is then given by 

p(y*|x*,X,Y)= J p{y*\x*,u:)p{L,\X,Y) du;, 

with y* G The distribution p(a;|X, Y) cannot usually be evaluated analytically. Instead we 
define an approximating variational distribution q{ui), whose structure is easy to evaluate. 

We would like our approximating distribution to be as close as possible to the posterior distribution 
obtained from the full Gaussian process. We thus minimise the Kullback-Leibler (KL) divergence, 
intuitively a measure of similarity between two distributions: 

KL(g(a;) |p(u;|X,Y)), 

resulting in the approximate predictive distribution 

g(y*|x*) = J p{y*\x*,u})q{u})du}. ( 6 ) 


Minimising the Kullback-Leibler divergence is equivalent to maximising the log evidence lower 
I Bishop 2006|, 


C-m '■= y 9 (‘^)logp(Y|X,a;)da;-KL(( 7 (a;)||p(u;)) 


(7) 


with respect to the variational parameters defining < 7 ( 0 ;). Note that the KL divergence in the last 
equation is between the approximate posterior and the prior over u}. Maximising this objective will 
result in a variational distribution q{Lo) that explains the data well (as obtained from the first term— 
the log likelihood) while still being close to the prior—preventing the model from over-fitting. 


We next present a variational approximation to the Gaussian process extending on | Gal and Turner 
2015|, which results in a model mathematically identical to the use of dropout in arbitrarily struc 


tured NNs with arbitrary non-linearities. 


3 Dropout as a Bayesian Approximation 


We show that deep NNs with dropout applied before every weight layer are mathematically equiva 
lent to approximate variational inference in the deep Gaussian process (marginalised over its covari 
ance function parameters). Lor this we build on previous work | |Gal and Turner| 20151 that applied 
variational inference in the sparse spectrum Gaussian process approximation ||Lazaro-Gredilla et al.' 


|2010[ . Starting with the full Gaussian process we will develop an approximation that will be shown 
to be equivalent to the NN optimisation objective with dropout (eq. ([^) with either the Euclidean 
loss (eq. Q) in the case of regression or softmax loss (eq. (|^) in the case of classification. This 
view of dropout will allow us to derive new probabilistic results in deep learning. 


4 



















3.1 A Gaussian Process Approximation 


We begin by defining our covariance function. Let a{-) be some non-linear function such as the 
rectified linear (ReLU) or the hyperbolic tangent function (TanH). We define K(x, y) to be 

K(x, y)= / p(w)p(6)cr(w^x-b 6)(T(w^y-f &)dwd6 


with p(w) a standard multivariate normal distribution of dimensionality Q an d some distribution 
p{b). It is trivial to show that this defines a valid covariance function following jTsuda et'aL 2002) . 


We use Monte Carlo integration with K terms to approximate the integral above. This results in the 
finite rank covariance function 

1 ^ 

K(x, y) = ^ ^ + ^fc)cr(w^y + bk) 

fc=i 

with wj; ^ p(w) and bk ^ p{b)- K will be the number of hidden units in our single hidden layer 
NN approximation. 


Using K instead of K as the covariance function of the Gaussian process yields the following gen¬ 
erative model; 

Wfc '^p(w), bk ^p{b), 

Wi = [wk]k=i,h= [bk]k=i 
1 ^ 

K(x, y) = ^ 51 ^ + bk)<j(wly + bk) 

k^l 

F|X,Wi,b-Ar(0,K(X,X)) 

Y|F^AA(F,r-il^), (8) 

with Wi a Q X AT matrix parametrising our covariance function. 


Integrating over the covariance function parameters results in the following predictive distribution: 

p(Y|X)= J p(Y|F)p(F|Wi,b,X)p(WiMb) 

where the integration is with respect to F, Wi, and b. 

Denoting the 1 x AT row vector 

^(x,Wi,b) = y ^cr(Wfx-)-b) 

and the N x K feature matrix $ = [</)(x„, Wi, we have K(X, X) = <I>$^. We rewrite 

p(Y|X) as 

p(Y|X) = jN{Y- 0, -b T-ilAr)p(Wi)p(b)dWidb, 

analytically integrating with respect to F. 

The normal distribution of Y inside the integral above can be written as a joint normal distribution 
over Yd, the d’th columns of the N x D matrix Y, for d = 1,..., D. For each term in the joint 
distribution, following identity | |Bishop| |2006| page 93], we introduce a AT x 1 auxiliary random 
variable ~ A/’(0, Ik), 

A/'(yd;0, -bT“^l7v) = y A/'(yd; ^'w^, T”^l7v)A/'(wd; 0, I^jdw^. (9) 

Writing W 2 = [wd]d=i a AT X D matrix, the above is equivalent tc[^ 

p(Y|X) = y'p(Y|X,Wi,W2,b)p(Wi)p(W2)p(b) 
where the integration is with respect to Wi, W 2 , and b. 


^This is equivalent to the weighted basis function interpretation of the Gaussian process (Rasmussen an d 
Williams 2006) where the various quantities are analytically integrated over. 
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We have re-parametrised the GP model and marginalised over the additional auxiliary random vari¬ 
ables Wi, W 2 , and b. We next approximate the posterior over these variables with appropriate 
approximating variational distributions. 

3.2 Variational Inference in the Approximate Model 

Our sufficient statistics are Wi,W 2 , and b. To perform variational inference in our approximate 
model we need to define a variational distribution ( 7 (Wi, W 2 , b) := ( 7 (Wi)q(W 2 )(j'(y. We define 
qCWi) to be a Gaussian mixture distribution with two components, factorised over QIj 

Q 

9 (Wi) = g(wg), (10) 

q=l 

9(w,) = ct'^Ik) + (1 - Pi)A/'(0, (t'^Ik) 

with some probability pi G [0,1], scalar cr > 0 and G We put a similar approximating 
distribution over W 2 : 

K 

g(W 2 ) = g(wfc), ( 11 ) 

k^l 

g(wfc) = p2Af{mk, ct^Id) -f (1 - P2)A/'(0, (tHd) 

with some probability p 2 G [0,1]. 

We put a simple Gaussian approximating distribution over b; 

g(b) =7V'(m, (12) 

Next we evaluate the log evidence lower bound for the task of regression, for which we optimise 
over the variational parameters Mi = M 2 = and m, to maximise Eq. Q. The 

task of classification is discussed later. 


3.3 Evaluating the Log Evidence Lower Bound for Regression 

We need to evaluate the log evidence lower bound; 

rcp-vi ■■= J g(Wi, W 2 , b) logp(Y|X, AVi, W 2 , b) - KL(g(AVi, W 2 , b)||p(Wi, AV 2 , b)), 


(13) 


where the integration is with respect to Wi, W 2 , and b. 

For the task of regression we can rewrite the integrand as a sum: 

D 

logp(Y|X, Wi,^V 2 , b) = ^ logAA(yd; r-^Iiv) 


D 


ND . , ND . , ^ ■/ 11 ||0 

=-^ log(27r) -f — log(T) - ^ - llVd - $Wd|| 2 , 

as the output dimensions of a multi-output Gaussian process are assumed to be independent. Denote 
Y = $W 2 . We can then sum over the rows instead of the columns of Y and write 


D 




AT 


;^l|yd -ydll2 = Y tIIv" -y 


|2 
M2* 


Here = ^(x^, Wi, b)W 2 = y ;^a(x„Wi -h b)W 2 , resulting in the integrand 

N 

logp(Y|X, Wi, AV 2 , b) = ^ logAA(y„; Wi, b)W 2 , 

n—1 


"'Note that this is a bi-modal distribution defined over each output dimensionality; as a result the joint 
distribution over Wi is highly multi-modal. 
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This allows us to write the log evidence lower bound as 


N 


/ 9(Wi,W2,b)logp(y„|x„,Wi,W2,b)-KL(g(Wi,W2,b)||p(Wi,W2,b)). 


We re-parametrise the integrands in the sum to not depend on Wi, W 2 , and b directly, but instead 
on the standard normal distribution and the Bernoulli distribution. Let q{ei) = JV{0 ,Iqxk) and 
g(zi,q) = Bernoulli(pi) for q = I, and 9 ( 62 ) = Af{0,lKxD) and q{z 2 ,k) = Bernoulli(p 2 ) 
for k = 1,K. Finally let q{e) = Af{0, T-k)- We write 

Wi = zi(Mi + crei) + (1 - zi)(Tei, 

W 2 = Z 2 (M 2 + cre 2 ) + (1 - 22 ) 0 - 62 , 

b = m + ere, (14) 

allowing us to re-write the sum over the integrals in the above equation as 

TV 

^ / 9(Wi,W2,b)logp(y„|x„,Wi,W2,b)dWidW2db 

N 

= X! / 9 (zi,ei,Z 2 ,e 2 ,e)logp(y„|x„,Wi(zi,ei),W 2 (z 2 ,e 2 ),b(e)) 
where each integration is over ei, zi, £ 2 , Z 2 , and e. 


We estimate each integral using Monte Carlo integration with a distinct single sample to obtain; 

N 

/:gp-mc := 51 log Piynl^n , W”, W2", b’^) - KL{q{W i , W2, b) I |p(Wi, W2, b)) 

n—1 


with realisations W",W 2 ,b” defined following eq. ([T^ with e" ~ Af{0,1 qxk), z"q 


Ber-noullijpi), 62 ~ A/'(0,I^xd)^ ^^idz^~ Bernoulli(^ 2 )- Following j 

Blei et al. ^OU 

Hoffman 

et al. 

2013||Kingma and Welling| 2013 Rezende et al.||2014| Titsias and Lazaro-Gredill 

a 2014), 


optimising the stochastic objective £gp-mc we would converge to the same limit as £gp-vi- 


We can’t evaluate the KL divergence term between a mixture of Gaussians and a single Gaussian 
analytically. However we can perform Monte Carlo integration like in the above. A further approxi¬ 
mation for large K (number of hidden units) and small cr^ yields a weighted sum of KL divergences 
between the mixture components and the single Gaussian (proposition in the appendix). Intu¬ 
itively, this is because the entropy of a mixture of Gaussians with a large enough dimensionality 
and randomly distributed means tends towards the sum of the Gaussians’ volumes. Following the 
proposition, for large enough K we can approximate the KL divergence term as 

Q 

KL((7(Wi)||p(Wi)) « QKier^ - \og{eT^) - 1) + + C 

9=1 

with C constant w.r.t. our parameters, and similarly for KL(( 7 (W 2 )||p(W 2 )). The term 
KL(g(b)||p(b)) can be evaluated analytically as 

KL(< 7 (b)||p(b)) = i(m^m + K{eT^ - log(^") - 1)) + C 
with C constant w.r.t. our parameters. We drop the constants for brevity. 


Next we explain the relation between the above equations and the equations brought in section 2.1 


3.4 Log Evidence Lower Bound Optimisation 

Ignoring the constant terms t, cr we obtain the maximisation objective 

/^GP-MC y"ll 2 - y IIM1II2 - y IIM2II2 - ^||m||2. (15) 

n—1 

Note that in the Gaussian processes literature the terms r, cr will often be optimised as well. 

Letting cr tend to zero, we get that the KL divergence of the prior blows-up and tends to infinity. 
However, in real-world scenarios setting cr to be machine epsilon (10“^^ for example in quadruple 
precision decimal systems) results in a constant value logcr = —76. With high probability samples 
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from a standard Gaussian distribution with such a small standard deviation will be represented on 
a computer, in effect, as zero. Thus the random variable realisations W", W2 , b" can be approxi¬ 
mated as 

« zJMa, b" « m. 

Note that W" are not maximum a posteriori (MAP) estimates, but random variable realisations. 
This gives us 

y™ « W ^cr(x„(z”Mi) -f m)(z2M2). 

V A 


Scaling the optimisation objective by a positive constant ^ doesn’t change the parameter values at 
its optimum (as long as we don’t optimise with respect to r). We thus scale the objective to get 


1 


N 


/^GP-MC OC -^'^WYn-YnWl “ ^||Mi||2 - ^\\M2\\l - 


P2 


1 


2N 


2tN' 


2tN' 


2tN' 


(16) 


n—1 

and we recovered equation Q for an appropriate setting of r. Maximising eq. ( [T6] l results in the 
same optimal parameters as the minimisation of eq. Note that eq. is a scaled unbiased 
estimator of eq. With correct stochastic optimisation scheduling both will converge to the 

same limit. 


The optimisation of £gp-mc proceeds as follows. We sample realisations z”,Z 2 to evaluate the 
lower-bound and its derivatives. We perform a single optimisation step (for example a single gradient 
descent step), and repeat, sampling new realisations. 


We can make several interesting observations at this point. First, we can find the model precision 
from the identity Ai = 2 ^ which gives t = 2 \In ■ Second, it seems that the weight-decay for the 
dropped-out weights should be scaled by the probability of the weights to not be dropped. Lastly, 
it is known that setting the dropout probability to zero ipi — P 2 = 1) results in a standard NN. 
Following the derivation above, this would result in delta function approximating distributions on 
the weights (replacing eqs. (fTOjl-lfT^). As was discussed in | Lazaro-Gredilla et al. 20101 this leads 
to model over-fitting. Empirically it seems that the Bernoulli approximating distribution is sufficient 
to considerably prevent over-fitting. 


Note that even though our approximating distribution is, in effect, made of a sum of two point 
masses, each point mass with zero variance, the mixture does not have zero variance. It has the 
variance of a Bernoulli random variable, which is transformed through the network. This choice of 
approximating distribution results in the dropout model. 


4 Extensions 


We have presented the derivation for a single hidden layer NN in the task of regression. The deriva¬ 
tion above extends to tasks of classification, flexible priors, mini-batch optimisation, deep models, 
and much more. This will be studied next. 


4.1 Evaluating the Log Evidence Lower Bound for Classification 


For classification we have an additional step in the generative model in eq. (|^ compared to eq. 
<0, sampling class assignment c„ given weight y„. We can write this generative model using the 
auxiliary random variables introduced in section 3.1 for the regression case by 

p(c|X) = J p(c|Y)p(Y|X)dY 


= J p{c\Y)(^J p(Y|X,Wi,W2,b)p(Wi,W2,b)dWidW2db^dY 

where c is an W dimensional vector of categorical values. We can write the log evidence lower 
bound in this case as (proposition]^ in the appendix) 

£gp-vi := J p(Y|X,Wi,W2,b)(7(Wi,W2,b) logp(c|Y)dWidW2dbdY 

-KL(g(Wi,W2,b)||p(Wi,W2,b)). 
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The integrand of the first term can be re-written like before as a sum 

N 

logp(c|Y) = ^logp(c„|y„) 

n—1 

resulting in a log evidence lower bound given by 

N 

Ccp-vi Wi,W 2 ,b) 9 (Wi,W 2 ,b)logp(c„|y„) 

-KL(q(Wi,W2,b)|b(Wi,W2,b)) 

where the integration of each term in the first expression is over Wi, W 2 , b, and y„. 

We can re-parametrise each integrand in the sum following ( [T^ to obtain 

Wi = zi(Mi + crei) + (1 - zi)(Tei, 

W 2 = Z2(M2 + Cre2) + (1 - Z2)(Te2, 

b = m + ere, 

yn = yja(x„Wi+b)W2. 


(17) 


Like before, we estimate each integral using Monte Carlo integration with a distinct single sample 
to obtain: 

N 

CcP-MC := ^ logp(c„|y„(x„, b")) - KL(g(Wi, W 2 , b)| |p(Wi, W 2 , b)) 

n—1 

with realisations y„, W", W 2 , and b". 


Each term in the sum in the first expression can be re-written as 


logp(c„|y„) = - log exp(y„d/)^ . 


We evaluate the second expression as before. Scaling the objective by a positive 
the following maximisation objective. 


this results in 


N 


CcP-MC 


Pn,c 


Pi I 

2iv' 




P2 I 

2iv' 


Moll"- 


1 

m' 


m 


1^, 


n=l 

withpn c„ = logp(c„|yn), identical (up to a sign flip) to that of eqs. (|^, ([^ for appropriate selection 
of weight decay A. 


4.2 Prior Length-scale 


We can define a more expressive prior than Af(0, Ik) over the weights of the first layer Wi. This 
will allow us to incorporate our prior belief over the frequency of the observed data. Instead of 
p(w) = JV{w, 0, Ik) we may use pi{'w) = A/'(w; 0, 1~^Ik) with length-scale L 


To use this more expressive prior we need to adapt proposition [T] approximating the prior term’s KL 
divergence. It is easy to see that the KL divergence between q(xj and pi (x) can be approximated as: 

L 


KL(g(x)||pi(x)) « ^ +tr(/^Si) - iT- log|S*| + Klogl 


i=l 


plus a constant for large enough K. This follows from the KL divergence for multivariate normal 
distributions, and can be repeated for b as well (defining pii (b) = Af (b; 0, 1'~^Ik))- Following the 
previous derivations we obtain the regression objective 


N 


r ^ II ^ ii2 ^^P^ 

Lgp-mc oc ||y„ - y„||2 - 


n—1 


2tN 


\Mi\\l- 


P2 

2tN' 


|M2||2- 




2tN 


m 


2i 


and similarly for classification. 


For high frequency data, setting I and I' to small values would result in a weaker regularisation over 
the weights Mi and m. This leads to larger magnitude weights which can capture high frequency 
data |Gal and Turner! [2015| . 
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(18) 


The length-scale decouples the precision parameter r from the weight-decays A; 

l^Pi 


Ai — 


which results in 


2Nt' 

Ppi 

2N\i 


(19) 


The length-scale is a user specified value that captures our belief over the function frequency. A 
short length-scale I (corresponding to high frequency data) with high precision r (equivalently, small 
observation noise) results in a small weight-decay A - encouraging the model to fit the data well. 
A long length-scale with low precision results in a large weight-decay - and stronger regularisation 
over the weights. This trade-off between the length-scale and model precision results in different 
weight-decay values. 


A similar term to the length-scale can be obtained for M 2 . Eq. (|^ can be rewritten by substituting 
we replace ^a(Wf x + b)w£i with cr(W^x -f b)w^ and replace 0, Ik) 




with 0, ■ This results in the standard network structure often used in many implemen¬ 


tations (without the additional scaling by y ;^). Following the above derivation, we can rewrite the 
KL divergence between g(x) and A/'(0, K~^1k) to obtain the objective 

N j2 

|yn - yn|l2 - “ ?,int7ll^''^2|l2 


r ^ II ^ I |2 Iiiv/r ||2 ||2 

•^GP-MC OC — ^ ||yn — yn|l2 ~ ?r^ll^lll2 ~ — 


2tN' 


27iv' 


m 


( 20 ) 


K here acts in a similar way to I and V. A large number of hidden units results in a stronger 
regularisation, pushing the elements of M 2 to become smaller and smaller. A smaller K will result 
in a weaker regularisation term over M 2 , allowing its elements to take larger values. 


4.3 Mini-batch Optimisation 

We often use mini-batches when optimising eq. 0. This is done by setting eq. Q to 




n£S 


with a random subset of data points S of size M, and similarly for classification. 

Using recent results in stochastic variational inference | Hoffman et al. 2013| we can derive the 
equivalent to this in the GP approximation above. We change the likelihood term — | ^^=1 I ly" ~ 
y „||2 in eq. ( [T5] l to sum over the data points in S alone, and multiply the term by ^ to get an 
unbiased estimator to the original sum. Multiplying equation (fTS]) by 


-L as before we obtain 

tN 


-Egp-mc ~ “2M E 11^” “ “ 




n^S 


2tN 


2tN 


m 


( 21 ) 


recovering eq. ([^ for the mini-batch optimisation case as well. 


4.4 Predictive Log-likelihood 

Given a dataset X, Y and a new data point x* we can calculate the probability of possible output 
values y* using the predictive probability p(y*|x*, X, Y). The log of the predictive likelihood 
captures how well the model fits the data, with larger values indicating better model fit. 

Our predictive log-likelihood can be approximated by Monte Carlo integration of eq. 0 with T 
terms; 

logp(y*|x*,X, Y) = log J p(y*|x*,w)p(a;|X, Y)du; 

»log J p(y*|x*,w)g(a;)dw 

/I ^ 
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with ujt ~ q{<^) ■ 

For regression we have 

logp(y*|x*,X, Y) wlogsumexp^- ^rHy-yt 
with our precision parameter r. 


logT 


1 

2 


log 27r 


1 

2 


logT ^ 


( 22 ) 


Uncertainty quality can be determined from this quantity as well. Excessive uncertainty (large ob¬ 
servation noise, or equivalently small model precision t) results in a large penalty from the last term 
in the predictive log-likelihood. An over-confident model with large model precision compared to 
poor mean estimation results in a penalty from the first term - the distance | |y — yt | P gets amplified 
by T which drives the first term to zero. 


4.5 Going Deeper than a Single Hidden Layer 

We will demonstrate how to extend the derivation above to two hidden layers for the case of regres¬ 
sion. Extension to further layers and classification is trivial. 

We use the deep GP model - feeding the output of one GP to the covariance of the next, in the same 
way the input is used in the covariance of the first GP. However, to match the dropout NN model, 
we have to select a different covariance function for the GPs in the layers following the first one. 
Eor clarity, we denote here all quantities related to the first GP with subscript 1, and as a second 
subscript denote the element index. So (pi^nk denotes the element at row n column k of the variable 
$ 1 , and 4>i „ denotes row n of the same variable. 

We next define the new covariance function K 2 . Let a 2 be some non-linear function, not necessarily 
the same as the one used with the previous covariance function. We define K 2 (x, y) to be 

K 2 (x, y) = ^ y p(b 2 )cr 2 (x -f b 2 )'^cr 2 (y + b 2 )db 2 

with some distribution p(b 2 ) over b 2 G 


We use Monte Carlo integration with one term to approximate the integral above. This results in 

K2(x,y) = ^(j(x-f-b2)^cr(y-f ba) 

K2 

with b 2 ~ p(b 2 ). 


Using K 2 instead as the covariance function of the second Gaussian process yields the following 
generative model. Eirst, we sample the variables for all covariance functions: 

~ p(wi), bi^k ~ p{bi), b2 ~ p(h2) 

with Wi a Q X ^1 matrix, bi a Ki dimensional vector, and b 2 a K 2 dimensional vector. Given 
these variables, we define the covariance functions for the two GPs: 

1 

Ki(x,y) = — ^CTi(w^feX-h6i,fc)cri(w^fcy-h5i,fc) 

^ k=l 

K2(x,y) = ^<T(x + b2)^cr(y Tba) 

Conditioned on these variables, we generate the model’s output; 

Fi |X,Wi,bi-A((0,Ki(X,X)) 
F 2 |X,b 2 ~A(( 0 ,K 2 (Fi,Fi)) 

Y|F2~A((F2,t-1I^). 


We introduce auxiliary random variables W 2 a Ki x K 2 matrix and W 3 a K 2 x D matrix. The 
columns of each matrix distribute according to Af{Q, I). 
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Like before, we write Ki(X,X) = with $i an TV x Ki matrix and K 2 (X,X) = <f> 2 ^’ 2 " 

with $2 an TV X K 2 matrix: 


h,nk = \ -r^Cri(wi + hi^k) 

Ki 


02,nfc = \/ + ^2,fc)- 


We can then write Fi = <i)iW 2 , since 


Ep(W2)(Fi) — ‘hiEp(W2)(W2) — 0 

and 

CoVp(W2)(Fi) = Ep(W2)(FiFf) = <i>iEp(W2)(W2W2^)$f = $ 1 $^ , 
and similarly for F 2 . Note that Fi is an TV x K 2 matrix, and that F 2 is an TV x D matrix. Thus, 


4>2, 


nk 


= ^/—a2(w 


2,k^l,r. 


+ b2,k)- 


Finally, we can write 

y„|X,Wi,bi,W2,b2,W3 

The application of variational inference continues as before. 

Note that an alternative approach would be to use the same covariance function in each layer. For 
that we would need to set W 2 to be of dimensions Ki x Ki normally distributed. This results 
in a product of two normally distributed matrices: W 2 and the weights resulting from the Monte 
Carlo integration of K 2 (denoted W 2 for convenience). Even though the composition of two linear 
transformations is a linear transformation, the resulting prior distribution over the weight matrix 
W 2 W 2 is quite complicated. 


5 Insights and Applications 


Our derivation suggests many applications and insights, including the representation of model un¬ 
certainty in deep learning, better model regularisation, computationally efficient Bayesian convolu¬ 
tional neural networks, use of dropout in recurrent neural networks, and the principled development 
of dropout variants, to name a few. These are briefly discussed here, and studied more in depth in 
separate work. 


5.1 Insights 


The Gaussian process’s robustness to over-fitting can be contributed to several different aspects of 
the model and is discussed in detail in | Rasmussen and Williams 2006| . Our interpretation offers a 
possible explanation to dropout’s ability to avoid over-fitting. Dropout can be seen as approximately 
integrating over the weights of the network. 


Our derivation also suggests that an approximating variational distribution should be placed over 
the bias b. This could be sampled jointly with the weights 'W. Note that it is possible to interpret 
dropout as doing so when used with non-linearities with (t( 0) = 0. This is because the product 
by the vector of Bernoulli random variables can be passed through the non-linearity in this case. 
However the GP interpretation changes in this case, as the inputs are randomly set to zero rather 
than the weights. By sampling Bernoulli variables for the bias weights as well, the model might 
become more robust. 


In | |Srivastava et al. 20141 alternative distributions to the Bernoulli are discussed. For example, it is 
suggested that multiplying the weights by A/^(l, cr^) results in similar results to dropout (although 
this becomes a more costly operation at run time). This can be seen as an alternative approximating 
variational distribution where we set q{'Wk) = mj; + nifecre with e ~ JV{0, 1). 


We noted in the text that the weight-decay for the dropped-out weights should be scaled by the 
probability of the weights to not be dropped. This follows from the KL approximation. We also note 
that the model brought in section [ZT] does not use a bias at the output layer. This is equivalent to 
shifting the data by a constant amount and thus not treated in our derivation. Alternatively, using a 
Gaussian process mean function given by /i(x) = b^ is equivalent to setting the bias of the output 
layer to b^,. 
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5.1.1 Model Calibration 


We can show that the dropout model is not calibrated. This is because Gaussian processes’ un¬ 
certainty is not calibrated and the model draws its properties from these. The Gaussian process’s 
uncertainty depends on the covariance function chosen, which we showed above to be equivalent 
to the non-linearities and prior over the weights. The choice of a GP’s covariance function follows 
from our assumptions about the data. If we believe, for example, that the model’s uncertainty should 
increase far from the data we might choose the squared exponential covariance function. 

For many practical applications this means that model uncertainty can increase with data magnitude 
or be of different scale for different datasets. To calibrate model uncertainty in regression tasks 
we can scale the uncertainty linearly to remove data magnitude effects, and manipulate uncertainty 
percentiles to compare among different datasets. This can be done by fitting a simple distribution 
over the training set output uncertainty, and using the cumulative distribution function to hnd the 
relative ratio of a new data point’s uncertainty to that of existing ones. This quantity can be used to 
compare a data point’s uncertainty obtained from a model trained on one data distribution to another. 

For example, if a test point has standard deviation 5, whereas almost all other data points have 
standard deviation ranging from 0.2 to 2, then the data point will be in the top percentile, and the 
model will be considered as very uncertain about the point compared to the rest of the data. However, 
another model might give the same point standard deviation 5 with most of the data modelled with 
standard deviation ranging from 10 to 15. In this model the data point will be in the lowest percentile, 
and the model will be considered as fairly certain about the point with respect to the rest of the data. 


5.2 Applications 

Our derivation suggests an estimate for dropout models by averaging T forward passes through 
the network (referred to as MC dropout, compared to standard dropout with weight averaging). 
This result has been presented in the literature before as model averaging [ Srivastava et al.| 2014) . 
Our interpretation suggests a new look as to why MC dropout is more sensible than the current 
approach of averaging the weights. Furthermore, with the obtained samples we can estimate the 
model’s conhdence in its predictions and take actions accordingly. For example, in the case of 
classification, the model might return a result with high uncertainty, in which case we might decide 
to pass the input to a human to classify. Alternatively, one can use a weak and fast model to perform 
classihcation, and use a more elaborate but slower model only on inputs for which the weak model in 
uncertain. Uncertainty is important in reinforcement learning (RL) as well | SzepesvM) 2010) . With 
uncertainty information an agent can decide when to exploit and when to explore its environment. 
Recent advances in RL have made use of NNs to estimate agents’ Q-value functions, a function that 
estimates the quality of different states and actions in the environment |Mnih et al. 2013) . Epsilon 
greedy search is often used in this setting, where an agent selects its currently estimated best action 
with some probability, and explores otherwise. With uncertainty estimates over the agent’s Q-value 
function, techniques such as Thompson sampling | Thompson 19331 can be used to train the model 
faster. These ideas are studied in the main paper. 


Following our interpretation, one should apply dropout before each weight layer and not only before 
inner-product layers at the end of the model. This is to avoid parameter over-htting on all layers 
as the dropout model, in effect, integrates over the parameters. The use of dropout before some 
layers but not others corresponds to interleaving MAP estimates and fully Bayesian estimates. The 
application of dropout before every weight layer is not used in practice however, as empirical results 
using standard dropout on some network topologies (after convolutions for example) suggest infe¬ 
rior performance. The use of MC dropout, however, with dropout applied before every weight layer 
results in much better empirical performance on some NN structures. 

One can also interpret the approximation above as approximate variational inference in Bayesian 
neural networks (NNs). Thus, dropout applied before every weight layer is equivalent to variational 
inference in Bayesian NNs. This allows us to develop new Bayesian NN architectures which are 
not directly related to the Gaussian process, using operations such as pooling and convolutions. 
This leads to good, efficient, and trivial approximations to Bayesian convolutional neural networks 
(convnets). We discuss these ideas with empirical evaluation in separate work. 

Another possible application is the adaptation of dropout to recurrent neural networks (RNNs). 
Currently, dropout is not used with these models as the repeated application of noise over potentially 
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thousands of repetitions results in a very weak signal at the output. GP dynamical models [Wang 


|et al.| |2005[ and recursive GPs with perfect integrators correspond to the ideas behind RNNs and 
long-short-term-memory (LSTM) networks |Hochreiter and Schmidhub^ 1997| . The GP models 
integrate over the parameters and thus avoid over-fitting. Seen as a GP approximation one would 
expect there to exist a suitable dropout approximation for these tasks as well. We discuss these ideas 
in separate work. 


In future research we aim to assess model uncertainty on adversarial inputs as well, such as corrupted 
images that classify incorrectly with high confidence [Szegedy et al.[pOT4) . Adding or subtracting a 
single pixel from each input dimension is perceived as almost unchanged input to a human eye, but 
can change classification probabilities considerably. In the high dimensional input space the new 
corrupted image lies far from the data, and one would expect model uncertainty to increase for such 
inputs. 


Lastly, our interpretation allows the development of principled extensions of dropout. The use of 
non-diminishing cr^ (eqs. to ( |T^ ) and the use of a mixture of Gaussians with more than two 
components is an immediate example of such. For example the use of a low rank covariance matrix 
would allow us to capture complex relations between the weights. These approximations could 
result in alternative uncertainty estimates to the ones obtained with MC dropout. This is subject to 
current research. 


6 Conclusions 

We have shown that a neural network with arbitrary depth and non-linearities and with dropout 
applied before every weight layer is mathematically equivalent to an approximation to the deep 
Gaussian process (marginalised over its covariance function parameters). This interpretation offers 
an explanation to some of dropout’s key properties. Our analysis suggests straightforward generali¬ 
sations of dropout for future research which should improve on current techniques. 
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A KL of a Mixture of Gaussians 


Proposition 1. Fix K,L & fi, a probability vector p = (pi, and positive- 

definite for i = 1, L, with the elements of each not dependent on K. Let 

L 

<?(x) = ^p*A/'(x;/Xi,S,) 

be a mixture of Gaussians with L components and /Xj € normally distributed, and let p[x.) = 
The KL divergence between q{x.) and p(x) can be approximated as: 

L 

KL{q{x)\\p{x)) « ^ y(/xf/Xi + fr(S,) - if(l + log27r) - log|S,|) 
plus a constant for large enough K. 


Proof We have 


KL((?(x)||p(x)) = / g(x)log 


<z(x) 


dx 


p(x) 

= J 9 (x) log 9 (x)dx- 


q{x) logp(x)dx 


= -n{q{x)) - J g(x) logp(x)dx 

where H((jf(x)) is the entropy of g(x). The second term in the last line can be evaluated analytically, 
but the entropy term has to be approximated. 


We begin by approximating the entropy term. We write 

'H(( 7 (x)) = -^P^ /'A/'(x;/Xi,S,)logg(x)dx 

= -& 

with LiLf = Si. 

Now, the term inside the logarithm can be written as 

L 

q{Pi + L.Ci) = + Cep, pj, Sj) 

i=i 

= _ 1||^^ - p. - L,e,|||J. 

i=i 

where 11 • 11 s is the Mahalanobis distance. Since /x^, /x^ are assumed to be normally distributed, the 
quantity pj — is also normally distributed. Using the expectation of the generalised yf 

distribution with K degrees of freedom, we have that for iT >> 0 there exists that jj/x^ — /x^ — 
LiCilll; >> 0 for i 7 ^ j (since the elements of Sj do not depend on K). Finally, we have for 

i = j that ll/Xi — Pi — Lieilll;. = ef Lf L~^L“^Liei = ejCi. Therefore the last equation can be 
approximated as 

g(p, + L,e,) «p,( 27 r)"-^/ 2 |£;^|-i/ 2 g^p| _ 

This gives us 

n{q{yi)) K,-^p, J A/'(e,;0,I)log ^p*(27r)"-^/2|S,|~i/2exp{ - ^efe,}^de. 


Afiep 0,1) log q{p, + Ce^dei 
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= ^Y(^log|S*|+ jAf{e,;0,T)efe,de, + K\og2Trj +C 

where C = — Pi log Pi- Since ej distributes according to a distribution, its expectation 
is K, and the entropy can be approximated as 

L 

'H{q{x)) « ^^(log|S,| +K(1 + log27r)) +C 
Next, evaluating the first term of the KL divergence we get 

J 9(x) logp(x)dx = J Af{x; /x,, S,) logp(x)dx 

i—1 

for p(x) = A/'(0, Ik) it is easy to validate that this is equivalent to — ^ Pi{f^T Mi + tr(Si)). 

Finally, we get 

L 

KL(q(x)||p(x)) « +tr(S,) -iF(l + log27r) -log|Si|) - C. 

i=l 


□ 


B Log Evidence Lower Bound for Classification 


Proposition 2. Given 

p(c|X) 


J p{c\Y)p{Y\X)dY 


p(c|Y) 


p(Y|X,Wi,W2,b) 


• p(Wi, W 2 , b)cAVitAV2c/b j dY 

where c is an N dimensional vector of categorical values, one for each observation, we can write 
the log evidence lower bound as 

Ccp-vi-= J p(Y|X,Wi,W 2 ,b)g(Wi,W 2 ,b) 

• logp(c|Y)tAVitfW2cflbifY 

- XL(g(Wi, W 2 , b)||p(Wi, W 2 , b)). 


Proof We have 


logp(c|X) 

= log y'p(c|Y)p(Y|X,Wi,W 2 ,b) 

• p(Wi, W 2 , b)dWidW2dbdY 

= log J g(Wi,W 2 ,b)p(Y|X,Wi,W 2 ,b)p(c|Y) 

p(Wi,W 2 ,b)^„^ 

• /w w , s dWidW2dbdY 

9 (Wi, W 2 ,b) 

> I g(Wi,W 2 ,b)p(Y|X,Wi,W 2 ,b)log(^p(c|Y) 

p(Wi,W 2 ,b)^^„^ 

• /xxr xxr dWidW2dbdY 

9(Wi, W2,b)y 

= / g(Wi,W2,b)p(Y|X,Wi,W2,b) 
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• logp(c|Y)dWidW 2 dbdY 

-KL(g(Wi,W2,b)||p(Wi,W2,b)), 

as needed. □ 


C Predictive Mean 


Proposition 3. Given weights matrices of dimensions Ki x Ki-i, bias vectors m.i of dimen¬ 
sions Ki, and binary vectors Zi of dimensions Ki-i for each layer i = 1, as well as the 
approximating variational distribution 

9(y*|x*) := Af(y*;y* (x*, zi,z^), T"^Ir,)Bern(zi) • • • Bern{zL) 
for some r > 0, with 


y = 


y'^^(MLZL)cr^...y^(M2Z2)cr((MiZi)x* + 


we have 


with 


E, 


«(y*|x*)(y*) « -^y*(x*,Zi,t,...,ZL,t) 


Zi^t ^ Bern{pi). 


Proof 

E 9 (y.|x*)(y*) = y y*9(y*|x*)dy* 

= y y*A/'(y*;y*(x*,zi,...,Zi),T"^l£,)Bern(zi) • • • Bern(zi)dzi • • -dz^dy* 

= y ^y y*A/'(y*;y*(x*,zi,...,ZL),T"^l£,)dy*^Bem(zi) • • • Bern(zL)dzi • • -dz^dy* 

= y y*(x*,zi, ...,Zi)Bern(zi) • • •Bem(zL)dzi • • -dz^ 

1 ^ 
t=l 

□ 


D Predictive Variance 

Proposition 4. Given weights matrices of dimensions Ki x Ki-i, bias vectors of dimen¬ 
sions Ki, and binary vectors Zi of dimensions Ki-i for each layer i = 1, as well as the 

approximating variational distribution 

g(y*|x*) := p{y*\yi\u3)q{u>) 


q{u}) — Bem{zi) ■ ■ ■ Bern{zi^) 

p{y* |x*, w) = ff{y*;y* (x*, zi, zl),t~^Id) 


for some r > 0, with 


r = yj(MLZi)a( 

(M2Z2)cr((Mizi)x*+mi)...y 

we have 


Eg(y.|x*)((y*)^(y*)) 

T 

+ y ^ y* (x* , Zi,t , ...,ZL.t)'^y* (x*, Zi^t 
* _1 

with 

I — J. 


Zi,t -- Bern{pi). 
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Proof. 

E9(y*ix*)((y*)^(y*)) 

= y (y(y*)^(y*)p(y*|x*,u;)dyy g(u;)da; 

= y (^CoVp(y.|x.,^)(y*) +Ep(y.|x.,^)(y*)^Ep(y.|x^a.)(y*)^g(‘*^)dw 

= y + y*(x*, zi,..., ZL)'^y*(x*, zi,..., zl)^B ern(zi) • • • Bern(zL)dzi • • • dz^ 




since p(y*|x*, u;) = AA(y*; ^(x*, zi,..., z^), r-il^). 


□ 


E Detailed Experiment Set-up 


E.l Model Uncertainty in Regression Tasks - Extrapolation 

We ran a stochastic gradient descent optimiser for 1,000,000 iterations (until convergence) with 
learning rate policy base-lr * (1 + 7 * iter)”^’ with 7 = 0.0001,p = 0.25 and momentum 0.9. We 
initialise the bias at 0 and initialise the weights uniformly from [—-ys/fan-in, -ys/fan-in]. We use 
no mini-batch optimisation as the data is fairly small and with high frequencies. The learning rates 
used are 0.01 with weight decay of le“°® for CO 2 (corresponding to a high noise precision of le®). 
This is to model the low observation noise in the data due to the scaling and high frequencies. 

E.2 Model Uncertainty in Reinforcement Learning 

For the purpose of this experiment, we used future rewards discount of 0.7, no temporal window, and 
an experience replay of 30,000. The network starts learning after 1,000 steps, where in the initial 
5,000 steps random actions are performed. The networks consist of two ReLU hidden layers of size 
50, with a learning rate and weight decay of 0.001. Stochastic gradient descent was used with no 
momentum and batch size of 64. 

The original implementation makes use of epsilon greedy exploration with epsilon changing as 

. / / age-burn-in \\ 

e = mm 1, max e^m, 1--——j— -^ 

\ \ steps-total — burn-in / / 

with steps-total of 200,000, burn-in of 3,000, and Cmm = 0.05. 


E.3 Model Uncertainty in Regression Tasks - Interpolation 


For interpolation we repeat the experiment in the main paper with ReLU networks with 5 hidden 
layers and the same setup on a new dataset - solar irradiance. We use base learning rate of 5e“^ and 
weight decay of 5e“^. 


Interpolation results are shown in fig.[2 Fig. la shows interpolation of missing sections (bounded 
between pairs of dashed blue lines) for the Gaussian process with squared exponential covariance 
function, as well the function value on the training set. In red is the observed function, in green are 
the missing sections, and in blue is the model predictive mean. Fig. lb shows the same for the ReLU 
dropout model with 5 layers. 


Both models interpolate the data well, with increased uncertainty over the missing segments. How¬ 
ever the GP’s uncertainty is larger and over-estimates its 95% confidence interval in capturing the 
true function. The observed uncertainty in MC dropout is similar to that of |Gal and Turner 20151 
with the model over-confident in its predictions. Note that this is often observed with variational 
techniques, where model uncertainty is under-estimated. This is because all variational inference 
techniques would under-estimate model uncertainty unless the true posterior is in the class of ap¬ 
proximating variational distributions. Note also that the models correspond to different GP covari¬ 
ance functions that would result in different variance estimations. 
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(b) MC dropout with ReLU non-linearities 


Figure 1: Predictive mean and uncertainties on the reconstructed solar irradiance dataset with 
missing segments, for the GP and MC dropout approximation. In red is the observed function 
and in green are the missing segments. In blue is the predictive mean plus/minus two standard 
deviations of the various approximations. 
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